Project-Team:ARIC

Inria | Raweb 2014 | Presentation of the Project-Team ARIC | ARIC Web Site


	PDF	e-Pub

Previous |

Home | Next next

Section: New Results

Floating-Point arithmetic

On the computation of the reciprocal of floating point expansions using an adapted Newton-Raphson iteration

Many numerical problems require a higher computing precision than that offered by common floating point (FP) formats. One common way of extending the precision is to represent numbers in a multiple component format. With so-called floating point expansions, numbers are represented as the unevaluated sum of standard machine precision FP numbers. This format offers the simplicity of using directly available and highly optimized FP operations and is used by multiple-precisions libraries such as Bailey's `Q'D or the analogue Graphics Processing Units tuned version, GQD. Mioara Joldes (LAAS), Jean-Michel Muller, and Valentina Popescu introduced a new algorithm for computing the reciprocal FP expansion $a^{- 1}$ of a FP expansion $a$ . Their algorithm is based on using an adapted Newton-Raphson iteration where “truncated” operations (additions, multiplications) involving FP expansions are used. The error analysis given shows that their algorithm allows for computations of very accurate quotients. Precisely, after $i \geq 0$ iterations, the computed FP expansion $x = x_{0} + \dots + x_{2^{i} - 1}$ satisfies the relative error bound $| \frac{x - a^{- 1}}{a^{- 1}} | \leq 2^{- 2^{i} (p - 3) - 1}$ , where $p > 4$ is the precision of the FP representation used ( $p = 24$ for single precision and $p = 53$ for double precision) [19] .

Error bounds on complex floating-point multiplication with a fused-multiply add

The accuracy analysis of complex floating-point multiplication done by Brent, Percival, and Zimmermann [Math. Comp., 76:1469–1481, 2007] is extended by Peter Kornerup (Odense Univ. Denmark), Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller [42] to the case where a fused multiply-add (FMA) operation is available. Considering floating-point arithmetic with rounding to nearest and unit roundoff $u$ , they show that the bound $\sqrt{5} u$ on the normwise relative error $| \hat{z} / z - 1 |$ of a complex product $z$ can be decreased further to $2 u$ when using the FMA in the most naive way. Furthermore, they prove that the term $2 u$ is asymptotically optimal not only for this naive FMA-based algorithm, but also for two other algorithms, which use the FMA operation as an efficient way of implementing rounding error compensation. Thus, although highly accurate in the componentwise sense, these two compensated algorithms bring no improvement to the normwise accuracy $2 u$ already achieved using the FMA naively. Asymptotic optimality is established for each algorithm thanks to the explicit construction of floating-point inputs for which it is proven that the normwise relative error then generated satisfies $| \hat{z} / z - 1 | \to 2 u$ as $u \to 0$ . All these results hold for IEEE floating-point arithmetic, with radix $β \geq 2$ , precision $p \geq 2$ , and rounding to nearest; it is only assumed that underflows and overflows do not occur and, when bounding errors from below, that $β^{p - 1} \geq 12$ .

Refined error analysis of the Cornea-Harrison-Tang method for $a b + c d$

In their book Scientific Computing on Itanium-based Systems, Cornea, Harrison, and Tang introduced an accurate algorithm for evaluating expressions of the form $a b + c d$ in binary floating-point arithmetic, assuming a fused-multiply add instruction is available. They showed that if $p$ is the precision of the floating-point format and if $u = 2^{- p}$ , the relative error of the result is of order $u$ . Jean-Michel Muller improved their proof to show that the relative error is bounded by $2 u + 7 u^{2} + 6 u^{3}$ . Furthermore, by building an example for which the relative error is asymptotically (as $p \to \infty$ or, equivalently, as $u \to 0$ ) equivalent to $2 u$ , he proved that this error bound is asymptotically optimal [11] . Claude-Pierre Jeannerod then showed in [41] that an error bound of the form $2 u + 2 u^{2} + O (u^{3})$ in fact holds for any radix $β \geq 2$ , with $u = \frac{1}{2} β^{1 - p}$ . He also showed that the possibility of removing the $O (u^{2})$ term from this bound depends on the radix parity and the tie-breaking strategy used for rounding: if $β$ is odd or rounding is to nearest even then the simpler bound $2 u$ is obtained, while if $β$ is even and rounding is to nearest away, then there exist floating-point inputs $a, b, c, d$ that lead to a relative error larger than $2 u + \frac{1}{β} u^{2}$ .

On the maximum relative error when computing integer powers by iterated multiplications in floating-point arithmetic

Stef Graillat (Paris 6 University), Vincent Lefèvre and Jean-Michel Muller improved the usual relative error bound for the computation of $x^{n}$ through iterated multiplications by $x$ in binary floating-point arithmetic. The obtained error bound is only slightly better than the usual one, but it is simpler. They also discussed the more general problem of computing the product of $n$ terms [7] .

Improved error bounds for numerical linear algebra

When computing matrix factorizations and solving linear systems in floating-point arithmetic, classical rounding error analyses provide backward error bounds whose leading terms have the form $γ_{n} = n u / (1 - n u)$ for suitable values of $n$ and with $u$ the unit roundoff. With Siegfried M. Rump (Hamburg University of Technology), Claude-Pierre Jeannerod showed in [13] that for LU and Cholesky factorizations as well as for triangular system solving, $γ_{n}$ can be replaced by the $O (u^{2})$ -free and unconditional constant $n u$ . To get these new bounds the main ingredient is a general framework for bounding expressions of the form $| ρ - s |$ , where $s$ is the exact sum of a floating-point number and $n - 1$ real numbers, and where $ρ$ is a real number approximating the computed sum $\hat{s}$ .

On relative errors of floating-point operations

Rounding error analyses of numerical algorithms are most often carried out via repeated applications of the so-called standard models of floating-point arithmetic. Given a round-to-nearest function $RN$ and barring underflow and overflow, such models bound the relative errors $E_{1} (t) = | t - RN (t) | / | t |$ and $E_{2} (t) = | t - RN (t) | / | RN (t) |$ by the unit roundoff $u$ . In [34] Claude-Pierre Jeannerod and Siegfried M. Rump (Hamburg University of Technology) investigated the possibility of refining these bounds, both in the case of an arbitrary real $t$ and in the case where $t$ is the exact result of an arithmetic operation on some floating-point numbers. They provided explicit and attainable bounds on $E_{1} (t)$ , which are all less than or equal to $u / (1 + u)$ and, therefore, smaller than $u$ . For $E_{2} (t)$ the bound $u$ is attainable whenever $t = x \pm y$ or $t = x y$ or, in base $> 2$ , $t = x / y$ with $x, y$ two floating-point numbers. However, for division in base 2 as well as for square root, smaller bounds are derived, which are also shown to be attainable. This set of sharp bounds was then applied to the rounding error analysis of various numerical algorithms: in all cases, they obtained either much shorter proofs of the best-known error bounds for such algorithms, or improvements on these bounds themselves.

Comparison between binary and decimal floating-point numbers

In collaboration with Christoph Lauter and Marc Mezzarobba (LIP6 laboratory, Paris), Nicolas Brisebarre and Jean-Michel Muller introduce an algorithm to compare a binary floating-point (FP) number and a decimal FP number, assuming the “binary encoding” of the decimal formats is used, and with a special emphasis on the basic interchange formats specified by the IEEE 754-2008 standard for FP arithmetic. It is a two-step algorithm: a first pass, based on the exponents only, quickly eliminates most cases, then, when the first pass does not suffice, a more accurate second pass is performed. They provide an implementation of several variants of our algorithm, and compare them [37] .

Correctly rounded sum of floating-point numbers in GNU MPFR

Vincent Lefèvre has designed a new algorithm to compute the correctly rounded sum of several floating-point numbers, each having its own precision and the output having its own precision, as in GNU MPFR. At the same time, the mpfr_sum function is being reimplemented (not finished yet). While the old algorithm was just an application of Ziv's method, thus with exponential time and memory complexity in the worst case such as the sum of a huge number and a tiny number, the new algorithm does the sum by blocks (reiterations being needed only in case of cancellations), taking such holes between numbers into account.